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. - ABSTRACT 

„ f '* 

1 ‘ 

AN UNRESTRICTED TEXT-TO-SPEECH ALGORITHM 
FOR THE VOTRAX SYNTHESIZER 

Stephen Bradly Stein 


,, .✓ '/ • .. • 

A new text-to-speech algorithm’ for the Vot rax' 

synthesizer based . on stress 'is described I ■ It converts a 
written word into speech in two stages; the assignment of- 
stress and the translation of the graphemes into phonemes. . 


Stress assignment' is based on internal affixes and 
syllable count. The external_.af fixes, which do not affect 

■ . , i 

stress, are removed.' Syllable count is determined from the 
,number of vowel's, dipthongs and silent "e"s detected. 

Following this, the graphemes are translated into their 

* . / - 

phonetic ' representation via letter-to-sound rules. A 

- * > \ * , 

t < 

dictionary, is used for words which do not-obey these rules. 
This method gives 100% correct * pronunciation to - the most 
frequent 5,000 words in the Brown corpus. 


An experiment was conducted to 1 test the intelligibility 





of the system' against a human voice and to see i.f it would 
do better than a.° much simpler system; the "Type \ ! N Talk" 

- ' O','" < 

synthesizer, incorporating its own algorithm without stress 
assignment, produced by the same manufacturer of our Votrax 
synthesizer. Two different-sets of material."were used, a 

» , - ’ , 1 ' r 'v * 

paragraph from Time magazine' and nine lists of ten 

t * t i ' • * 7 

phonetically-balanced sentences., < 

' ’ , * * ' » 
i ' » * , , 

Our new algorithm scored much higher on the 
intelligibilty test than the "Type ’N Talk" synthesizer.; 
The maxima of'words correctly'comprehended were>: 27% for the 
"Type 'N. Talk", 66% for the new algorithm and 96% for human 
speech • ' 

\ 

« , ' 1 . ' ' * 

Ah analysis of variance' indicated that the increase in 
intelligibility observed' for all systems over time was 
significant to the 0.01% level. The results for ’ the 

paragraph... r _were'_m.udhX’l bet ter_'-indicating_the._value_oX 

contextual information. ■ t * 
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CHAPTER I 

INTRODUCTION 

Speech one q ^ the fundamental modes of human 

'v 

communication. Even in countries where a high.rate of 

illiteracy exists, the prime m'ode' of the inhabitants' 

< * , . * 
communication is via speech.' 




In one study by Ochsman and Chapanis [14] sixty groups, 
each consisting of two persons, were given a problem .to 
solve co-operatively. They were allowed to communicate only 
in -the manner of communication assigned to them. 
Interestingly enough, those groups instructed'to communicate, 
only by handwriting or typewriting took more,than twice as 
long to 3olve the problem as did the groups utilizing only 
vocal communication. Furthermore, most of the time consumed 
by the’ speech communicating groups was spent in active 
communication. • 


The researchers drew two general conclusions which .1 
believe everyone knows intuitively. No matter as to whether 
a person is sending or receiving a message, speech- 
'communication allows other activities to be done 


i 


r 



V 


simultaneously.. Deeper concentration cgp - the communication 

task i>s. required when using hard copy methods, thereby 

■c ■ , ' ■ ■ . 

increasing the mean time to solve a problem. 


■It is therefore reasonable to conclude that adding the 
capability of speech to machines would greatly improve the 
man-machine interface. 


In. this thesis I examine one aspect of this problem, 

speech synthesis by machine. No doubt, synthetic speech has 

many applications as seen presently with the proliferation 

> 

of talking toys , . watches , calculators, car instrument panels 

and yes, even microwave ovens. Speech adds a novel touch to 

these devices, although seldom altering their functional 
. * 

* ( 

capabilities. , 


Synthetic, speech unfolds at least one salient 

possibility. . It allows the blind as well as various other 

disabled individuals' access to the written word via jreading- 

machines, without transformation or alteration of the/ text. 

1 

** o , 

Synthetic speech also shows promise as a teaching method for 

, 4 r- 

our children. 


The research in -this thesis has been to implement a 

• / t i ( 

synthetic speech systpm .using a commercially available 
synthesizer that would ' have an unlimited vocabulary. Upon 
completion, it could be used as part of a reading machine 






h ' 


.3 ' 






for the' blind and obhei?" applications as cited 


> 


Methods of Synthetic Speech Production 




There exists numerous techniques to produce computer 

v * % , V • 

speech varying both, .in quality and complexity of 

6 

implementation. 


\ 


One of the first arid easiest ways is to record the 
speech, utterances on a device similar to an audio tape 

r ~‘ [ 

recorder. The speech is taped on one track of thp tape and. 

start and stop' ma.rks, are placed on another*. When the 

Computer wishes to verbalise a point it-simply reads the 

< * > 

tape atj high Speed until it finds the start mark of the 
sentence or’ phrase it wants. Once the phrase is,found, it 

i ' ’ * 

places the tape, recorder in play.mode and reads the other 

track for the stop mark.. Once found, it stops the tape 

* 

recorder. — ° 


■ V.I 

unly^thu 




Certainly-'this method is crude, but it can produce 
highly intelligible and natural sounding speech. The word 

"can" is used because this is only true if.the entire phrase 

, \ 

is recorded as a whole, as opposed to the machine stringing 
together words recorded separately. In the latter case, the 
speech will sound lifeless and dull due to the lack of 
control of the suprasegmental aspects of the speech, such.as 


> 





4 






intonation over the entire 


utterance.' 

s 



t 


Other potential problems of this'method include the 

.lack of-electromechanical reliability and-the .degradation of 

\ 

speech quality as the magnetic med|a wears’'out. 

. 'V '• . J ( 


A variation of this system was the IBM 7^0 Drum System 
which used a magnetic drum rather than a tape. " 

« 

Digital Speech 

* V 

Another relatively simple and much more reliable method 

involves digitizing the speech and storing it in solid state 

«* * 

memory. This is accomplished by playing the speech ( through 
an analog to digital converter with the output directed to a 
computer. When the computer wishes to speak, * it simply 
invokes the reverse process,. takirtg the digital 

representation of the speech from its memory and playing it 
back through . a digital to analog converter with the output 

o 

driving an ordinary audio speaker. 



Such a technique produces high quality speech with a 
*. , 
high degree of. reliability. Unfortunately, due to a 

required bit rate^of approximately 30 to 100k bit? per 

second an immense' memory is heeded which proves practical 
* ’ »■ ' . 
only for the smallest of vocabularies. A speech wave . being 




d 


composed of a few .basic -repeating waveforms is highly 
redundant. Linear predicative coding makes use of tfijygb facft 
and eliminates the" superfluous and stores only the es'sential 
data in the form of linear' predicative, coefficients. 


This method still produces ^ high quality speech 

^ # 

utilizing much f less memory than the preceding method. The 
bit rate used can be as low as 1200, bits per second, but 
2400 bits per second would I be- the ‘norm. 


All the above methods discussed so far are restricted 

. ' , 

tcf a .finite vocabulary. They. ca,n, only produce utterances 
that have been predetermined. Additionally, each new 
^utterance Sdded-to the computer's repertoire, would require 
additional memory/space. 


In certain applications, such as toys, test equipment, 

'flight simulators, weather, reports, talking clocks; and the 
* # 

like, wh^re the vocabulary is fixed^ and of a moderate 
length, these limitations are not important. ' It -probably 

** ft . 

makes good sense to use.linear predicative coding for these 
applications considering chips are available from Texas 


Instruments - especially for this purp< 

/ t ■ 


dse. (5n 


e such chip is 


the TMC028TNL, a digital signal processor 1 containikg timing 


-and decoding circuits, a .10 pole digital lattice filter''and 

a digital to analog converter. -Another chip is the 

* ' "■ ^ 

controller chip TMC0271HL Which does the mathematical 





► ' 
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.calculations required in linear predicative coding.) 


One final word about the above methods is ^n order. If 
it is decided to make additions to the. vocabulary the new 

t i. 

recording fnust be done by the same individual whp originally 

recorded the words. If that individual is not available, 

, « 

then the entire vocabulary must’ be redone! . 1 

t 

* ' - 

» - r 

' There exists ' many applications where' an unlimited 
speech vocabulary is needed, such as in reading machines for 
the blind, large and varied data bases,, or for a truely 
flexible man-machine interface. (Of course, one would also 
need unlimited speech recognition.) « 

, . • % 

The Human Vocal Tract 

— I . ■ — ■ ■■ ■ i i ■ .. 1. i . i t) 

i I 

It would seem reasonable that o-ne way to obtain 
unlimited speech would be to produce a synthesizer based on 
the human vocal tract. . ' 

Humans produce sounds by creating periodic and random 

acoustic excitation within the -vocal tract. As air from the 

*’ ' 

lungs is forced through, the vocal chords begin to vibrate 
causing periodic acoustic energy which is termed 'voiced' 
energy. Any sound which is produced with the vocal chords 
vibrating is termed voiced, as' in the long "V" sound. 





Conversely, an-y sound produced with the vocal' chords open is 

re-ferred to aS 'unvoiced', such as the long "f" sound. This 

air then passes over the articulators (i.e. teeth, tongue, 

fi A ' '■ 

lips, etjc) to produce random acoustic' excitation or as'It is 

more commonly called, frication. ' , - 


1 One must reali 

of many resonant 

acoustic energy p 

formants will be fo 

* 

that ’part, of the vo 
formed, only the th 
produce intelligibl 


ze that the entire vocal tract consists' 
cavities which act as filters. As the 

r 

asses through peaks, commonly called 
rmed at or near the resonant frequency of 

cal tract. Of the series of formants 

. \ 

ree lowest in frequency need be varied to 
e speech.. 


For 'each new sound that one wishes to utter the 
articulators must be repositioned-. . As this repositioning 

» x o 

\ , 

occurs the frequency response of,the vocal tract will change 
smoothly, rather than abruptly, because of the articulators' 
smooth movement from one state to another. This means that 
each sound produced’ is influenced by what occurred before it 
(dynamic articulation). . 

t - - 

\ * 

The Votrax Synthesizer <> 

From the above brief description of human sound 
production we , can view it as nothing mor.e than a series of 




filters and acoustic energy sourc'es. 

The Votrax synthesizer uses this' fact' to become .an 

■ -I 

electronic analog of the human vocal tract* It consists of 
x two sound generator circuits. One produces voiced sounds 
and the other produces fricative sounds. These two outputs 

c* 

are joined and passed through a set of filters to simulate, 
the vocal tract's resonance. . . 

, As it stands the parameters driving the above circuits- 

would .have to be updated every 5-25' milliseconds. However, 

the^ Votrax has some additional circuitry, the parametric 

^control unit, that eliminated the need to update these 

parameters. This circuit controls and updates all the 

r _ + 

“ parameters needed to produce any of 'sixty-one phonemes. 
(Votrax literature, for models VS-6.0 and SC-01, says 
sixty-three, but two of them are pauses.) ' 

Finally, there exists a dynamic articulation control 
unit between the parametric control unit and the remainder 
of the circuitry. It serves' to modify -the parametric 
control output to account for the movement of the 
articulators between phonemes. - 

The entire synthesizer |s hardware controlled and all- 
that is required to run it is. a stream of phonetic codes 

represented in s six bit words. Certain Votrax models allow 



' <a 9 ^ 

> 

other s parameters to be" controlled. Model VS-6.0, the one 

* - • 

which' is used for this research, has four set inflection 

levels Which change the phonemes' pitcih and amplitude, 

requiring an additional two bits. Therefore, the entire 

* 

command word has eight bits. 

The bit rate needed to drive these synthesizers is only 

i 

.about 70 bits per second and it uses much less memory than 
any of the othe'r methods. 

\ * . .... 

' \ ^ R 

One drawback of this synthesizer is that the sound 

9 

quality is not nearly as good as the o^ther methods, although 
it is intelligible. The lack of software control of the 
pitch, speech rate, and amplitude are serious shortcomings 
when trying to control the suprasegmental aspects of speech, 
mainly because it is nearly impossible! 

- • Nevertheless, the Votrax synthesizer , was chosen for 
this research, because it requires a low bit rate to operate 
and with suitable programming it . can ’ be m^de ,to speak 
unlimited English. Although it was nearly impossible to 
synthesize the prosodic elements of speech using this 
synthesizer, it would nonetheless allow for the production 
of a small unlimited speech synthesis system. 


Generation of this set of rules is n-o easy task. Ip is 
wel,l known that the orthographic representation of English 


1 






% 
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words does not ‘always coincide with its correct 

* V B 

pronunciation. One letter may, map to many different sounds._ 
' ' * 

It is this problem that this thesis addresses. (See chapter 
3 for the\algorithm.). ’ 


\ 

Although' i the software was developed on a, large\ 

' - /. 

mainframe, CDC’s Cyber 172,.there is no reason why it pannot 

. * * 

be made to operate on some of today's newer microcomputers. 

.. \ 

*> 

Although a Votrax .VS-6.0 was used , .Votrax produces'a single 

& 

\ , * 

“Chip synthesizer, the SC-01, which is virtually identical to 
the VS-6.0, which can be usejd instead. . 






v 




_ , ' * • , 

• * - * CHAPTER II - 

' \\ ' • 

EXISTING SYSTEMS ' 

/ * _ 

TheYe exists ^a, number, of speech by rule systems, some 
of - them being simple, with others being far more 
complicated . 

’ *» ' < 

4 

The methods employed by humans to produce speech from 
'its orthographic representation are not fully understood. 
Furthermore, a system that employed all these things, if 
they were all known, would surely be of an incredibly large . 
size. The particular problem that presents itself, is the 
selection of an algorithm of a’suitable size to be run on a- 
. small computer, without overtaxing its processing 

capabilities. If this criterion cannot be met*, then it is 
most likely the resultant system .would be far too expensive 
for the average person to acquire. ^ 

With the. availability of.’ small inexpensive 

microprocessors it would be sufficient to have- an algorithm 

that could run on, one of" these r and dedicate the 
$ 

microprocessor for this particular application. 


11 



■<The problem remains to find an algorithm that is 
sufficiently compact to run on.one of these microprocessors, 
yet produce acceptable synthetic speech results using the 
Votrax synthesizerT 

Elovitz et al [5], at the Naval Research Laboratory in 

f > 1 

Washington DC, wrote a very simple, program' to run the Votrax 

t ' ' 

VS-6.0 synthesizer. The entire method is based^on 329 
letter -to sound rules exclusively. this approach, the 

text is scanned from left to right;, and for each character 
scanned, the rules are sequentially searched until rules 
that ,are relevant to that character are found. Once a 
relevant rule is found the phonetic sound, in IPA, for- that 
letter or letters-' is sent to a buffer. There is no 
provision for exceptions; -a rule must be incorporated into 
the'list of rules for each exception. < 

* ' ‘ ’ ' \ 

' • • L - 

Lastly, the- output of these rules is passed through 
another set of rules, similar to the letter to sound rules 
that translate the IPA symbols to the Votrax phonetic codes. 

Sucb a system is very compact and easily implemented on 

. •» < ' 

practically any microcomputer. The drawback is that it, is 

i 

■ not all that accurate , and suffers - from a lack o-f 
intelligibility. One particular problem with the system 1 is 
, the lack of stress rules. Correct pronunciation of English 
words necessitates stressing the correct syllable of 



polysyllabic, words. Interestingly enough, the system on 

• © 

which Elovitz first based his approach on, Ainsworth's 
[1,2], does' have such' rules. However, Ainsworth used a 
parametric synthesizer whi-£ffi~ ) gave him control over the 
actual frequency each phoneme was given and its duration. 

Elovitz used the Votrax synthe'sdrzer and could not accomplish 

r ' 

this in the same manner. In part, stress can be realized on 
the Votrax (see chapter 3), but perhaps, the method is not 
readily apparent. 


In any case, Ainsworthfs stress rules are 'very 
simplistic and far from complete. Stress is determined in 
the -following manner. Each word is checked to a list of 
closed class words (articles,prepositions, conjunctions, 

'i 

etc), and if it is found in this list no stress is assigned. 

> < 

If not found, it is then checked.to see if it contains a 

prefix, if so, its second syllable was stressed, 'otherwise, 
. * 

the first syllable is stressed. 


this is hardly an accurate method for polysyllabic 
words, b.ut is acceptable for mono and many bisyllabic words.' 
It should be noted that this method does' not necessitate 
doing syllabification, but merely looking for a vowel.' 


• v . Ainsworth also included pauses between breath 
boundaries. Breath boundaries , were defined at:- 1), 
punctuation marks; 2) preceding a conjunction; 3) between' a 


noun phrase and a verb phrase; ■ 4) before a prepositional 
phrase; 5) before a noun phrase; and 6) after 5 0 characters 
have appeared in the input without meeting any of the other 
conditions. ,As one,_may have expected, Ainsworth” s system, 

like Elovitz's, is not ,toe intelligible. 

« * 

i 

' 9 , 

l • 

Both of the above programs suffer from a lack of 

*» 

intonation control. Intonation is ‘an -essential part of the 
English language and in some, cases, it alqne determines the 
meaning .of a sentence, 

i , * , 

A recent paper by Witten [231 describes a system that 
provided for intonation and rhythm of the sentence. ^The 
rhythm is assigned by a ' complicate'd set of rules and ,a 
look-up table. Intonation must be marked by the person 
entering the text. That is, the text must tte input 
phonetically; syllable, word and phrase, and all boundaries, 
as well as pitch (intonation), mu^t be marked by hand 
(1,1+,1-,2,etc). Clearly, this is not automatic in that it 
relies too heavily on the knowledge of the person typing the 
text. It is therefore useless in the application of 
automatic reading for the blind. In fact, for the program 
to be of any use, a program needs to be written that 
performs the text to phoneme conversion and assigns the 
■correct intonation. 

There is no doubt that if we ; are aiming for natural 


spending .speech, we must account for all the factors 

\ ’ 

involved in a native speaker's linguistic competence. This 

* 1 

is an extraordinarily difficult task because the 

« 1 

phonological level of language interacts with both the 
syntactic 'and semantic levels. The interaction of these 
three systems occurs in such a Way that ignoring the higher 
level hierarchies would not produce anything.- approx imating 
native-speaker English. This involves complicated sets of 
rules, mainly because no simple set of adequate rules for 
describing either English syntax or semantics exist. 
Consequentially, any system for speech synthesis designed in 
this way is going to be more complicated than the systems 
considered thus far. 

Furthermore, any system attempting to approximate 
Viative-speaker English for one reason or the other, Inust 
address problems that' the other systems have chosen'to 
ignore. For example, there is the problem of compound words 

i T* 

which can engender a medial silent "e": The’ word scarecrow 

has a silemt "e", as does "therefore", but most systems, will 

\ 

pronounce this silent "e" because by virtue of the 
compounding ' that has occurred, the "e" is no longer 
considered final and is consequently no longer considered 
silent. - 

It .would seem that this deficiency could only be 
.avoided by employing a look-up table . of all possible 
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compounds that have medial silent "e"s. This defect also 

ft 

creates -another one for systems which deal with stress, for 
it will have a significant effect on stress .placement, 
because stress placement is based on syllable count. The 
medial 8 silent "e" will be seen as a syllable where it is 
not, the subsequent incorrect syllable count may'cause 
incorrect stress placement. 


There is another unit f of speech that researchers might 

look at that will possibly yield better results. This is' 

. • / 

L> 

the morpheme. English words tend' to have an internal 
structure, and their constituent parts are called morphs, 

' „ O 

which include^ -prefixes, derivational suffixes, -and 

inflectional suffixes. They are not limited to ^ust these, 

‘ ^ • 

but'can be free, that is, a base word .-such as "truck" and 
"person". Or thfey can be a bound; words which must be 
combined with another morpheme. All English words consist 
of morphs. 


This is the case because native speakers are much more 

£ 

attentive 1 to .and conscious of morphemes than phonemes. 

> 

Morphemes are more crucial in maintaining the semanHc 

\ 

continuity of an utterance; native speakers attend to 
strings of phonemes only so far as they are necessary in 
constructing these more meaningful morphemic units. 

Consequently, any system of rules which attempts to 

■ * 

synthesize speech by considering the morphemic structure of 
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lexical items in a sentence, has come one step closer to 
approximating native speaker competence. In addition to 
being a more accurate 1 model of the psychological reality of 
speech perception, such a system based on morphemes can also 
offer a certain facility a,nd accuracy in stress assignment. 
This in tur'n, i 
produced . 



fstem Based on Morphemes 


Setting, pp a system that uses the morpheme is not a 
simple matter. This is due to the fact that unlike 
Elovitz’s or Ainsworth's systems, a system, based on 

morphemes must necessarily involve many more complicated 

* 

sets of rules. For this reason, it is natural that most 
researchers have steered away from such a task because they 

f 

want something that is easily implemented. However Jonathen 
Allen [ 3 ] of M.I.T'. has endeavored to produce such a 
system. This is one of the most advanced and interesting 

* J 

systems for synthesizing spe.ech that exists. 

» 


Allen's rules fall into two main types: 'those- used for 
the morpheme system and those comprising the lette^-to-sound 

system. If a word contains- a morph which. is not 

1 

recognizable, then it is transferred from the morph system 
to the letter-to-sound system, where the word is sounded out 
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in the way that a native speaker would when he or she 
encounters a*new word.' ^ • 


Irt the' morphera^^art of the system an attempt is madev 
to break up the word .into its morphs. This is no simple 
task, as "can be seen by examining the word "resting". It 
could be broken 'into "re-sting" or "rest-ing". Obviously 

« N * 

* t 

"rest-ing" would be the correct chpice (or at least, most 

- probable), and A'llen's .system will chojg^e this one, because 


the. inflectional affix (as opposed to'the derivational one) 

. \ 

is preferred in this case, as his rules realize. His system 
will also correctly pronounce the ’ word "wouhd" in the 
following sentences: "I ha\te multiple wounds" and ’,'1 wound 
up the string". The correct choice is made because there is 
a set of rules which realizes that d'wounds" has only an 
inflectional affix whereas "wound" has two underlying 
inflections, that is, it is really "wind" and "ed"' givi!ng 
the past tense "wound"': However, one can observe that these 
kinds of rules are very complicated. 


, In the above example, problems still' -remain when 

v- 1 

"wound" is a noun." 


Once the pronunciation of the' constituent morphs are 

i \ * ' v.. 

found, they must be combined to form the completed word. 
Unfortunately, <a straight concatenation of the phonetic 
transcription of the words as produced by morph analysis or 
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letter to sound rules cannot be executed, because the morphs 

X I - • j '• 

are prono-unced according -to their- location in the word. 

“ i 1 

Thus, the system incorporates a complete set of stress rules 

! , 

which are used ; to determine the accurate pronunciation of 

' i j v 

• : ' Y 

morphemes-word-contextually by making use of facts such as 

that suffixes cart j .have a strong effect on the roots to whidh 
they are attached, (ej.g. "felon/felonious" and 

"electric/electricity"} . Tips fine tuning of the phonemic 

strings through morphophonemic and lexical stress rules is a 

■ ° \ 

very important difference between this,system and others. 
However, notice that these rules can only be implemented 
once the words are broken up into their morphs. 
Accordingly, they cannot be easily- implemented, in either 
Ai'nsworth’s or Elovitz's programs. 

t ' 

This type of fine tuning is often done by'humans. 

Consider the first time you try to pronounce a word that is 

being read which you have never-seen before: you tend to 

sound it out by breaking it up into parts. The sounding-out 

is done either by recognizing the internal morphs or by 

sounding-oirt the morphs which you have never seen before, by 

means of letter-to-sound rules. Then, once you have sounded 

out the parts you quickly repeat the word, attempting to 
( ■ ■ 

correct, any sounds which originally were said incorrectly. 
That is, you allow, for changes In sound that take place due 
to the modification of the affixes. This is most apparent 
in young children when they are learning to r,e"ad. 
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r 

> 

There are also rules that add the prosodic, element- to 
the synthesized speech. Naturally, there are also rules 
which calculate the parameters for the parame.tric 
synthesizer being used. ' _ __ 

Whilst, there can be no^doubt that Allen’s system is 

. i 

certainly a good one, incorporating all known aspects of 
speech production, it is'not all that practical to implement 
on a. small microcomputer. 

li 

The morph 'dictionary alone" contains over 12,000 
» 

entries] In addition, there are all the other rules which 
require to be taken into account. 1 ‘ 

/ ' - 

Therefore, this system cannot be implemented' on 
equipment that the average individual could afford to buy, 
unless there is a huge marke't for it. 

* 

Price becomes a prime consideration ih most cases., In 
particular, for a blind person wishing to work with 
.computers, speech output would certainly make life- easier. 
The present'alternative, a much slower procedure, is the use 
of'expensive Braille terminals. ( 

* s 

If a talking terminal is inexpensive then ' a pompany. 
hiring a »blind individual would not object to buying one. 
Conversely, a very expensive one diminishes the likelihood 


> 



1 


of the company obtaining- one. 


So far, the-assumption has been that to be of any use 

/ r 

computer speech must sound -exactly the same as a native 
speaker. This -is not necessarily true. ,If the application 
is not one of a life and death situation, mistakes in the 
English pronunciation is tolerable. Humans possess v^qrpr 
versatile data gathering mechanisms and will readily correct 
many of the errors that the machine may make, within limits. 
Naturalist if most words are mispronounced then the system 
is useless. Absolute monotone, of course, should be avoided 
for long passages, but it may be tolerated in certain 
instances. ~ 


The algorithm for devices incorporating unlimited 

•speech output, such as a reading machine for. the blind, must 

, ' | 

be compact, yet fairly accurate on the more common words. 
This is necessary to keep the overall price down and make 
this device readily accessible to the general public. It 

5 

may require some training, but it should probably not take 

more, than- 1 an hour or so to become reasonably proficient at 
* - ** 

understanding the generated synthetic speech! 1 - 


"In the next chapter I would like to discuss a -system 
that attempts to meet the above criteria. Elovitz’s system 
is too simplistic. .Allen’s system is much too complex for 
practical purposes. The new algorithm presented herein is 
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considerably easier' to implement thari that of Allen's, but 
much more complex than that of Elovitz's. 

, • t 

A special referral must be made to a stud/ which ha,s 

{ * 

just come to hand (in the course of this writing) by Susan 
„ * * 

Hertz [ 9 ], whose aim it is to produce a speech system that 
is-accurate and compact. It incorporates a three-level 
stratagy. , ' ' ' 

.V * 

First, the word goes through a set of text modification 

il . 

rules that modify the orthographic representation of the 

word. For example, the "e" that is dropped when adding 

< » 

",ing" to care is added back. Also a feature matrix is 
associated with each word containing information as to 
whether each letter is vocalic or a consonant. This 
modified text is then passed through the conversion rules 
which transform it into a string of phonemes and an 
associated feature matrix. Information as to whether the 
sound is frontal, velar, alveolar, etc., is stored in the 
feature matrix. The resultant phonetic string is passed 
through the feature modification rules that add and delete 

segments and modifications to the feature matrix are made. 

7 

It should be noted that these rules also mark the stress. 

1 ' ' 1 

Finally, the string of phonemes and the feature matrix 
is passed through a set of rules that drive a parametric 
speech synthesizer. 
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This is an interesting system, but does' require the 
employ of ,a parametric synthesizer, otherwise, the' 
information in the feature matrix cannot be used. Although 
the ( system may be adequate, it do^s not .easily t lend itself 
to running the Votrax synthesizer.. 


Calculation of the parameters to drive the parametric 
synthesizer may be too much for a microprocessor to handle. 
It is interesting to note, that the.result of passing a word 
through her text modification rules is similar to the first 

* t \ 

part of the algorithm used in this thesis, even' though her 
method is different. 









■ CHAPITER III, - ' 

‘ 1 » i ' 

. • THE ALGORITHM 

A * ' / 

Initial attempts to formulate a set of text-to-speech 

rules disclosed that transcription of the vowels to their 

phonetic representation would pose problems if done only on 

the basis of graphemes. It appeared that much of the 

vowels* variations in pronunciation actually depended more 

on their stress than on its immediate segmental environment. 

With this in mind, it is self-evident that if the word 

stress tan be determined, then its transcription should 
» 

become more accurate with a reduced set of vowel rules. 
This is further reinforced with the knowledge that most 
unstressed vowels reduce to schwa and are therefore uniform 
across words. 

The ability to assign stress denotes two . aspects. 
First, the number of syllables, ca^ be determined for each 
word; second, that the capacity to assign the stress to ' the 
correct syllable exists regardless of any affixes attached 
to the- root word. 

Once the stress is determined a set of vowel rules must 
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employ these data and another' set of rules must handle all 
consonants and unstressied bowels. 

These concepts are basic to , the text-to-speech 
algorithm. 

First t a , word is read in. A word being defined as a 
string of letters delimited by any non letter, except in 

certain cases. The word is then checked to a prelexicPn and 
if found, its pronunciation is obtained from the prelexicon. 
If not, the process continues. 

■ .«2 

Syllabification is'then performed, followed by, external 

.. & ■ 

word affixes being removed. The prelexicon and the lexicon 
are now consulted, to see if the root word i,s contained 1 . If 
so, the root Word’s pronunciation is obtained from the 
lexicon. If not, stress is assigned. 

' The root word 'is then translated to a stream of 
phonemes utilizing the text-to-speech rules for stressed and 
unstressed vowels, and consonants. 

. ' ' *. X \ 

Finally, any affixes that were removed are glued back, 
to the word in their phonetic representation. The affixes 
are transcribed to their phonetic realization using the same 
set of rules as,the root word. 


\ 
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Now* let us examine the process in more detail . 

• ' ■ ■ ) 

Syllabification 

Analyzing a word into JLtrs"' component syllables is an 

easy task for 'the trivial case of monosyllabic words and 

/ - 

perhaps, for bisyllabic words. Beyond this, the process 
becomes more . difficult because of the inconsistency of the 
English language. However, since the system needs to know 
only the vowels that are i.n different syllables, for stress 
purposes, the process becomes much simpler. In which .case, 
we need-, not be concerned about clustering the consonants 
with the corr“ect vowels. With this in mind, it is 
sufficient to define operationally syllable number as the 
number of vowel clusters in a word. , 

Accordingly, the number of syllables in a word can be 

determined by breaking up the word into vowel and consonant 

. * 

clusters purely on the basis of their occurrence in the. word 
string. For example, the word "bookmart" would be divided 
into: /'b/oo/km/a/rt/. ‘There are exactly two vowel ‘ clusters 
obtained. This is precisely the number of syllables in the 
word. Words such as "creation" and "compute" would fail 
using this method. "Creation" becomes /cr/ea/t/io/n with 
two vowel clusters, but has three syllables. "Compute" 
becomes /c/o/mp/u/t/e/ with three 'vowel clusters, but only, 
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has two syllables. To syllabify these- words correctly 
necessitates having some hules which can determine which 
vowel groups are and which are not dipthongs. Furthermore, 
silent " e " s must be recognized* ' . . 

' / , 

To this end,.' a-, set of rules has been added which 

, N 

attempts to do this. However, these rules • will not be 
correct for a medial silent "e" as in the word shameful. 
(Note: "Shameful” will still be pronounced correctly because 
it will be reduced to the root word "shame"’. Silent "e" 
will then be found. See figure 3.1 at the end of this 

J 

chapter.) Thus, "shameful" will,be seen as having three 
syllables *when in fact it has only two. 

The actual algorithm used , is-as follows: 

'1 ) Check for word final silent ”e" ■ 

- "e" is silefftl} if 1) "el 1 is not preceded by a, e, i, 

* ^ ' 

o , u or 1 ‘ 

‘ 2) ”e".is precd^ai by al, el, il, ol, 

ul or yl. ' 

I 

✓ . i l ' ■ ^ % 

2) Break the word up into vowel and consonant groups. 

(A vowel is any of a, e, i, o , u, and y when y is 
preceded by a consonant. Note that.qu and gu are 

. i V 

considered as one consonant. All other letters 

are consonants.) . , , 
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3) Break a'vowel group intp two if the following 
conditions hold. (The, break'4s__made immediate 1 y 
after the first vowel ,of the vowel group. 

Note: "$" stands for a, e, i, o, u or y; 

"1 " any letter; , ■ 

"* " any consonant; 

"#" either a, e, i, o or u.) 


' i) 
ii) 
iii) 




A 

fol lowed 

by 

0 . 



Y 

fol lowed 

by 

I. 


, , -~- 

U 

fol lowed 

by 

A 

and 

not in persua, gua, jau or qua. 

U 

fol lowed 

by 

E 

and. 

in tuent, ruen; .lu or luen. 

U 

followed 

by 

I 

and 

in uine, uit$, uing, 1 ui or rui 

0 

followed 

by 

0 

and 

not found in quo or uor. 

U 

in oui. 



. 


0 

fol lowed 
• \ 

by 

E 

and 

found'in oel, oem, oet or oey. 

0 followed 

i 

by 

I 

and 

found in oic. 

6 

followed 

by 

U 

followed by I. 

I 

followed 

by 

U 

followed by M. 

I 

fol lowed 

by 

A 

and 

not found in liant, tian, 


tial, riage, sian, 1 ian r cial, liam, gia, nia. 


sia, cial, liar, cian,,tia or lia. 

I followed by 0 and not found in cious, 

t ; . 

xious, cion, shio, llio, vior, sion, tion, 
tiou, nio or gio . «■" 

I followed by 'E and not in tience, cient, 

nience, tient, i*cienc, nient, riend, dier, 
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riez, ries, tier, view, piec, iel , ief, mie, 
hie, qie, iev, fie, ie, ieu or yie. 

vi). E followed by A and found in #*ea, creat, react or 

■ ' , \ 

#cean. . 

E followed by O and found in eous, eol, eom, eod, 
neo 6r reo. 

E followed by U and found in eus, eum or }reu. 


Stress Rules 



d 


After determining the number of syllables in any word 
the next step is to utilize' this infotmation,to determine 
which syllable ' to stress. 


Chomsky and Halle's [4] m^n stress rules figured 
prominently in- considering the approach to take in stress 
determination. It could not, however, be applied directly 
because it placed much weight on. the grammatical class of 

i 

the word. This is unknown to the algorithm. Instead, a set 

/ 

of rules were formulated, based on certain considerations 

about stress set down by Axel Wijk [21] in his book "Rules 

of Pronunciation for the Engl ish' Language” . His ideas werje 

refined and formalized into a uniform .set of rules. 

* 

However, before these rul-es are applied, all external 
affixes are removed. An external affix is defined to be an 
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affix which does not affect the placement of stress ' on the 
word. 

\ 

This makes the assignment of stress easier because 

these affixes need not be considered. It also has a' further 

advantage. It ‘actually corrects mistakes that would 
* 

normally be made in the syllabification of some words.. For 
example; "basement" is seen as having three syllables due to 

i 

the silent medial "e" being counted as a syllable. However 
"ness" is an external suffix so it would be removed, 


resulting in "base". 

"Base" 

would co.rr 

ectly 

be 

seen 

as a 

one syllable word 

bee ause 

its fina 

1 »e" 

will be 

seen as 

silent. Pronunciation of "basement" wo 

uld 

now 

be 

correct 


because the system views the word as "base" and "ment" and 
will not pronounce the silent "e". Not only does the 
removal of external affixes make stress assignment simpler 
and more accurate, but helps detect silent medial "e" . 

Unfortunately, a word such as "bumblebee" will have its 
✓ 

si?-ent medial "e" pronounced, because "bee" is not removed. 
However, this is more the exception than the rule. 

The affixes^to be dealt with are those which are 
internal., now "that the external affixes have been removed. 
Internal affixes,, unlike external ones, do indeed affect 
stress. These affixes determine the word's stress depending 
on the number of syllables in the word. These affixes are 



! 
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searched for in a systematic way and the word stress 
assigned. 

\ : 

Of course, there are rules which act on words which v do 

not contain these affixes. They look only at syllable 

number to determine stress. The affix removal rules are as 

follows: (Note: If not other.wise stated, in rules which cut 

off a'ffixes it is assumed that fhe entire affix is cut off. 
« 

% 

•If this is not the case, then the part which will not be cut 
off will be enclosed in brackets. The rules are applied 
linearly. "#" means a, . e, i, o or u. means any 

consonant. "$" means a, e, i, o, u or y. "I" means any 

* n 

letter.) - ' 



words 

beginning 

with: 

under 

• para 

side 

some 

micrd 

north 

east 

south 

west 

up - 

down 

over 

house 

sky 

o ut 

/ 

air 

tele 

, i et 

after 

the’re 

mid 

head 

where 

hyd ro 



Treat the word as two different words 
provided that at least two letters follow 
the prefix. 




\ 


t 



0 


2) For all words ending in:' 

iably , ings ably 

' € 

* thing ing able 

Cut off the suffix, add."ed M and consider the 

word like this for stre.ss. Re-calculate the 

o • 

number of syllable's the word has.* 

V 

(Note: If after cutting-off the suffix there 
remains no syllables then ignore this rule.) 


3) For all words whose syllable count is greater 

than 2 and ends in: * 

r \ £ul ness (*)ly 

A (e) ly ‘ , 

« * 

Cut off theN^uffix, recalculate the number 

•' ^of syllables and change final "ie" to "y". • 

This bule is applied' twice in succession. 

0 » 

. / 

+ I - Y ' 1 

- / u , ____ 

4) For all words Snding in: 


cred 

v i t.ed 

qu# *ed 

ued 

*uned 

*eted 

created 

uated 

*#ded- 

*eded 

uaded 

uided 

anged r 

enged 

unged 

/ 

*ined 

* i red 

*ared 

* ur e'd 

to red - 

bored 

cored 

*ated 

iated 
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nited 

cited 

*oted 

*uted 

•iled 

oled 

uled 

timed 

rged 

#ped 

dged 

tiged 

*ti ked 

kked 

pi ed 

V - 

zled 

tised 

nsed 

psed 

rsed 

died 

ti bed 

bled 

tied: 

fled 

•vied 

aled 

thed 

. zed 

ved 

ied 

ced * 

* ided 


"Cut off the final "d" and re-calculate the 
number of syllables. Change final "ie" to "y". 


For all words ending in $ed, drop the final "ed" 
from the wordre-calcul ate the syllable coun.t. 

p 

i ff t is less than ope, disregard this rule. 

For all words ending in: 

? hes sses xes 

take off the^final "es". Subtract one from 
\ ' 

-the syllable courtt. 


For all words, ex c4pt those which are 
greater than 'three letters long or ehding 
in ss,. ending in : 


If 


eas , 


las 





3M 


ras 

*es 

ees 



oes 

ues 

ies 


t» 

/ 

Remove the final "s" and change 


. 

final "ie" to 

"y" . Re-calculate 

• 


the number of 

syllables. 




* 

t 


t 


> 

8 ) For all words 

ending in: 




time 

self 

more 








teen 

teenth 

town 

> 


cr 

*ety 

what 



■ $one 

. how 

where 


• 

day 

ism 

ment 



* 

doy 

*son 

land 



less 

man 

ton 



thing 

{ 

body 

ship ‘ 

•* 


out 

way 

fare 


• 

| k cast 

hood 

#over 

\ 

w 

‘ room 

wood 

house 



' ground ball 

ever 


\ 

Remove the suffix from the 

word and 



change final 

"ie" to "y" 

Re-calculate 

the numl 

of syllables 

in the word. 

If the number 

of 

syllables is 

less than one 

, ignore 

this 

rule. 

* 

4 

■ 

' — 

_ / 

Of course, an assumption 

has been 

made 

about i 

and internal affixes. ! That is, 

they can 

only 

be one 


other. This unfortunately, is not always the^case 


,1 
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In the first stress rule, the suffix "able” is removed 
with' the assumptionthat lit is an external suffix. 
Naturally, in the word "table” it is not. This fortunately,' 
presents no problem because the algorithm will not drop the 
affix*if it reduces a word to zero syllables. 

Consider "^dinirable” and "desirable”. Both are four 
syllable words and have .the "able” suffix. The problem is 
that the "able” suffix is internal to the word "admirable" 
and external to "desirable". Undoubtedly, exception rules t 
can be generated and of course, some were used. 

< Some of these rules do not necessarily reflect the 
manner in which native speakers organize their words, but 
they seem to work. The idea is to be sure that more words 
fit the rule rather than the exception, otherwise the rule' 
must not be retained. 

It is unacceptable under any circumstance to have a 
rule which is applicable to only one case. 

V 

Actual realization of stress presents a particular 
problem. • Ideally, the stressed syllables should exhibit- 
some rising and following of pitch as mehtioned by Hyman 
[12]. In practice, this is not at all feasible using the/ 

Votrax VS-6.0. Its four levels of inflection are 

\ - , 

distributed much too narrowly over the' stressed vowel and 

/ 
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the pitch change is far too .great. To be done’ properly, 
pitch' must be able to be varied by software in very small 
increments. This hardware limitation means that the stress 
can only be realized by changing the vowel duration. Most, 
but, not all vowdls can have one of four durations that are 
software selectable. 

I ' ‘ 

Even though the stress realization is crude, it 
appreciably improves the speech produced. 

f 

' The stress rules are as follows: (Note: "$" stands for 
a,e,i,o,u or y; "!" any letter; "*" any consonant; "//" 
either a,e,i or u. The stress rules are applied in a linear 
fashion until the first rule that applies is found and the 
process terminates.) 

I 

» # , > 

1) For all three syllable*words ending in: 


ster 

ual 

y 




am 

ise 

yze 

ize 

ate , 

ite 

ect 

ute 

ude 

• cle 

ist 

enue 

,i ne n t 

anent 

quent 

ident 



Stress the first 

syllable 

of the 


2) For all words of two or more syllables ending in: 
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' 


ect/- 

oy. . 

een 

eur 

eer 

oon 

ette 

esce 

aire - 

tine 

zine 

r ine 

dine 

chine. 

ina 

ona 

ana 

ita 

ever 

ota 


first 

syllable of 

suffix. 


3) For all words of three or more' syllables ending in: 


go 

ma 

to 

do 

ho 

scent 

dent 

ic , 

> 

que 

ia 

ion 

ogy - 

sive 

* 

iod 

ior 

ity 

ional 

ioner 


Stress the syllable before the suffix. 

4) For all words beginning with "trans" , stress 
tl^e the first syllable of the word. 

i 

5) For all words of four or more syllables ending in:. 


icacy 

imacy 

inacy 

mony 

sy 

ary 

t<?ry 

or y 

ator 

ize 




Stress the fourth syllable from the end of the word, 


V 
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'o 


6 ) 

For 

all 

words 

of 

four or 

more syllables 

ending ' 


in 

" al" 

: If o 

ne 

vowel or 

two,or more co 

\ 

nsonants 


precede 

the suffi'x , then 

V 

stress the syllable befo 


the 

add 

itional v 

‘ s> 

owel or 

before the two 

consonants 


If 

not, 

then 

str 

ess two 

syllables befor 

e the "al” 

7) 

For 

all 

words 

of 

two or 

more syllables 

ending in: 



- 

ence 


ency 

ous 





# cent 


ant 

sis 





on 

• 

an 

ar 

, 

** 

s 


um 


us 

a 



If 

one 

vowel 

or 

two or more consonants 

precede. 


the 

suffix then 

stress the syllable before the 


one 

vowel or 

two 

or more 

consonants. If 

s 

not, 


the 

n stre.s.s two 

syllables before the su 

f fix. 

8 ) 

For 

all 

words 

of 

two or 

more syllables 

ending in: 


. 


** le 


am 

or 





tist 


ail 

em 

* 


8 


ext 


ex 

* le 

V 





ile 

i 


er ; 

ance 

* 





ie 


el 

iemnt 





en 


fort 

.ward 

r 


i 


y [but not for "ply"] 

Stress the first syllable of the word. 
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For all words 

of two or more syllables 

having 1 the following prefixes: 

super 

sus 

intro 

inter 

mis 

cro 

suf 

en 

with 

com 

con 

de 

dis 

dir 

d i,v 

em 

ef 

ig 

f ' 

ex , 

imp 

j 

in ’ 

ob 

oc . 

opp ; _ % 

. per 

pre 

pro ■ , 

pur 

re 

sue , . . 

sug 

ir 

sup 

sur 

un 

"a 

be 

forg 


Apply this rule twice’, to 

check, for multiple 

and stress the syllable after the last prefi 

* i 1 , 

For all words 

of' t’hree ^syllables ending in: 

ent 

lie 


Stress the first syllable 

of the word . 

For all words 

of four or 

more syllables 

ending in: 



ture 

ible 

- 

Stress the fourth syllabi 

e fronf the end. 

- 

5*> 

' 

1 'l 

* 

1 




prefixes, 
x found . 
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12) For all words of four qr more syllables 
ending in "er": If one vowel 1 or two or 

I 

more consonants precede "er” then stress 
the syllable before the one vowel or two 
consonants. If not, stress two syllables 
‘before the ”er". 

, > 

13) For all words of two syllables stress the 

first syllable. , * 

0 , 

14) For all words of three syllables: stress 
the second syllable if the second syllable 

is heavy, else stress the^first syllable. 

/ ' «> 

(Heavy means one or more vowels followed 

by two or more consonants.) 

16) For all words of four or more syllables: stress the 
third syllable from the end of the word. 

t 

17) For all words that contain at least one 
syllable stress the first syllable. 

* 

Text-to-Speech Rules > 

Once stress has been determined, only one final action 
remains to be taken, that is, translation of "the text to a 
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stream of phonetic symbols, which i-s achieved by use of 
text-to-speech. rules.' The rules are organized bty letter and 

operate on the grapheme translating it to a distinct 

/ * 

phoneme. ' . ‘ , 

An attempt has been made to make the, number of rules 
associated with each grapheme correlate with the number of 
phonemically different environments each letter possesses. 
That is, an indication of the grapheme's phonological 
complexity can be obta-ined by the number of rules that is 
required to realize all its possible pronunciations. -• 

The appendix contains an explanation of the grammer 
used to describe a rule and some examples of the 448 
letter-to-sound rules. There are 424 entries in the 
lexicon. Let us work through a word. The word "cat" will 
be translated, in the following way; the "a" in cat will be 
marked that it should be stressed. Next, the word will be 
scanned letter by letter from left to right. First, it will 
look at "c" and try to' find a "c" rule. Looking linearly 
through the "c" rules it will apply each rule until it finds 
one that fits. In this case, the rule [c]=/k/ is used. It 
simply means that the letter "c" is pronounced "k" in this 
environment. The letters to the right of the equals sign, 
"k" in this case, represent the Votrax sounds as marked on 
its keyboard and are not IE^ symbols. The square brackets 
delimit the letter(s) that will be transcribed by the rule. 


42 


Next a rule for the letter "a" must be found among the 
stressed "a" \rules. [a] (cl)#=/ae/ is the one used. It . 
implies that jLf "a" is followed by one or more consonants 
and they are word final then "a", because it is the only 
letter enclosed in square brackets, is pronounced "ae". 
Clearly this rule applies for "at" in caf. 

V 

So far we have translated "ca," as "k ae". The "t" rule 
#*- • 

used is [t]=/t/. Thus cat would be realized as "k ae t" . 

A rule can be much more complicated than the ones 
illustrated. For example, a rule such as /g;#be/~[g]/i ;e/ 
Vll;rl/ / (vl) ;n;m; 1; r ;c; s ; #/ =. /d j/ is certainly valid and 
in fact one of the rules. It means that "g" or word initial 
"be" cannot occur before "g". However "g" must be followed 
by an "i“ or "e". They cannot be preceded by "11" or "rl", 
but must be followed.by either: one or more vowels; "n"; 
"m"; "1"; "r"; "c"; "s"; or be word final. If the above 
holds tiue then the "g" is pronounced as "d j"' as in 
gesture. The reader should note that as there are "a" 

stressed rules, also there are "a" non-stressed rules. Two 

I 

.sets of rules exist- for each vowel, a set when it is 
stressed and another when it is not stressed. Actually, it- 
is a bit more complicated than that. 

The rules -that exist under, "a" stressed rules would 
more correctly be referred to as exception rules. (This 




applies to. the remaining vowels also), In fact, there 
exists a category of rules called general stress rules that 
apply to all stressed vowels. ,<J In reality, the following 
sequence of events will take place when a stressed vowel is 
to be translated to its phonetic description. For purposes 
of convenience, let us use the vowel "a”. First, it is 
checked to the "a" stressed rules, if no rule is found 'it is 
compared to the general stress rules. If yet 'no rule is 
found then the vowel is given its short pronunciation. An 
explanation of the structure of the general stress rules 
will ' serve to comprehend what is meant by 'short 
pronunciation'. 

One of the general stress rules is [(v1 f 1)]# = l. 

Simply, it means that if the stressed vowel at which we are 


looking is word 

final then it 

is 

given 

its 

long 

pronunciation. 

The definition 

of' 

long 

and 

short 


pronunciation is contained in Table 3-1. 


i, 


The correct vowel 

sound is 

simply 

chosen from 

the 

table. Long "e" would 

be pronounced "e" 

as 

opposed -to 

"eh” 

for shdrt ”e". This method allows 

one rule 

to cover 

all 


.vowels and allows each to have a distinct sound.. 
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[wr]=/r/ and [w]=/w/. ■ The first one says "wr" is pronounced 


as "r" and the second one says the "w" is pronounced as "w". 

Obviously if [w]=/w/ occurred first then [wr]=/r/ could 

\ 

never be realized. -Thus, th'e rules depart from, the specific 
to the general. \ I 


Lexicon and Prelexicon 


There are a number of words which are exceptions to 
- rules. That is, a rule formulated to- cover the word, wo 
only be applicable to that word. Now, because -the ru 
must be searched in a linear fashion it would be a waste 
time to include exceptions in the rules. Because, of 


he 



es 



1 •‘if 
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nature .of the rules, these exceptions would generally have 

to be the first entries in the particular letter rule, which 

' » 

implies, that this rule will be scanned ever£ time a letter 
is encountered which is the first letter of the exception 
word. 

Since we know this rule can only be executed given a 
particular word it makes more sense to place it in a 

it 

dictionary or lexicon. Then a binary search of words' 
starting with the same letter as, the one in question, cotfld 
be done to see if the word is or is not an exception. This 
method is more speedy than searching linearly through the 
rules. For this reason a dictionary has been included .- 

The word "injury" is , in the lexicon. Note the word, 
"injuries" is not. This is so because the lexicon is 
searched ‘after external affixes are removed and ^’injuries" 
would have become .injury after removing the suffix "s". 
"Injury's" pronunciation would be accessed from the lexicon 
and the pronunciation of "s" would have been obtained from 
the letter-to-sound rules. 

\ i ' » t 

Generally speaking, it suffices then to place only the 
root word in the lexicon. ’ \ 

' ^ _ * 
One particular problem exists when a suffix is both 
external and internal as, in the case of "land". Islands 


Ji 


r 
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presents* a problem^ It will' be broken into "is” and "land". 

- , o 1 • 

‘Even if it is put in the lexicon it will never be found 

S'.''- . ’ 

because it will have-become "is".- This means the. lexicon 
must first - be searched. However, so few words-fa'll lhto 
this category, less than forty, it makes more sense to 
define a prelexicon so that there are less words for which 
to search. 


/• 


Intonation 


* 


As mentioned before, the Votrax synthesizer being used 

* 

has very limited intonation control. Little can be done to 

add the prosodic element to the outputed speech. The speech 
r ! 
produced is very mono'ton^c and it is difficult to determine 

o 

where a sentence ends and the next one begins. 


In a-bid to alleviate this problem the ’following was 
added; any time a period, exclamation mark or comma is' 
encountered, the pitch of the word preceding the punctuation 
mark is lowered from the stressed vowel onwards. Pitch is 
raised for a question mark. The punctuation marks mentioned 

> . ■ i 

are transcribed as a pause. 


Naturally, the ^pause generated for a comma is of a 
shorter duration' than for the other cases. These pauses are 
in addition to the pause generated between 1 words, which are 

) 

. * _ 


\ 
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.. . \ 

' r 

of. a .very, short duration. 


This minor enhancement has dramatically improved the 


overall quality of the speech"for long messages. 


This ends the explanation of the new algorthm,. Figures 

( ' - • , ' ' 

3.1 and 3.2 show an example of passing a word through the 
described algorithm. Notice that figure3*1 contains the 


example of the word 
chapter. 


"shameful" described earlier in 

' V 


the 


In the^ .next chapter the performance of the-algorithm 
presented here will be given. However, it should be. noted 
that it, is meaningless to discuss the accuracy of the 

* i 

component parts because individually they do not^add up to 
the final accuracy of the system. This is because each part 
takes into account the errors that the other parts may make. 
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FIGURE 3. 1 

EXAMPLE 1 (usihg the word "shameful") 


WORD? 'SHAMEFUL 



tHEREHS/ARE 3 SYLLABLE(S ) 


PREFIX/SUFFIX RULE(S) NUMBER(S)-3 
1 SYLLABLE (S) I.S/ARE CONSIDERED IN RULE 
AFTER STRESS RULE, WORD BECAME? SHAME 
FIRST SYLLABLE OF THE WORD IS STRESSED? A 
RULE 26 


MAIN WORD 
[SH]=/SH/ 

C(VI,1)](Cl,1)/Y;AIL;IA;10;IU;EA;EO;EU ;IE;AL#; 

OT#;IVE ;ILE;OR#;OLAT;E/ = L 
CM 3 = /M/ ' 

(Cl )[E3///;S#/=// ■ • 

SUFFIX 

%[FUL]#=/F UH3 L/ ' 
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\ 

' FIGURE 3.2 

EXAMPLE 2 (using the word "computers”) 


WORD* ’COMPUTERS' 7 , • 

C C 

V 0 

C MP 

V U 
t T 

V E t 

C ■ RS ? I 

THERE IS/ARE 3 SYLLABLE(S ) 

PREFIX/SUFFIX RULE(S) NUMBER(S)-7 
3 SYLLABLE(S ) IS/ARE CONSIDERED IN RULE 
AFTER STRESS RULE, WORD BECAME* COMPUTER 
STRESS 1 AFTER THE PREFIX %U 

RULE 19' , 

MAIN WORD 

[C ] = /K/ 0 , ’ 

C/0(S1);0/]=/UH3/ 

[M ] = /M/ 

[P] = /P/ 

(Cl )/TH;ST ;SH;CH;N ;X;Y;S;Z;J;L ;R/ A [U](C1,1) (VI ) = /fY U/ 
CT]a/T/ 

[/E(V1);E/]=/EH3/ 
t R ] = /R/ 

SUFFIX 

J'(C1)/T'/“[S]|s/Z/ 

r 



i 



CHAPTER IV 


PERFORMANCE , " 

% 

The first five thousand word entries in the Brown 

Corpus Cl 3] will be correctly pronounced by the system 

presented in this thesis, within the, hardware limitations of 

/ , 

the Votrax VS-6.0 synthesizer. 

This was ascertained by listening to the synthesized 
version of each word and by looking at the phonetic 
transcription that was , being produced to- drive the 
synthesizer. 

S' 

„ * 

A word was deemed correctly pronounced if ■ its 
pronunciation was k similiar to that of most Canadians or 
Webster's Dictionary. 

• > 

It would have been desirable to have been able to check 
the phonetic transcription of each ^word directly to '&■ 
dictionary, thereby giving an undisputed measure of 
accuracy. This cannot be done, because there is not an 

f c ’ 

exact correspondence between Votrax phonetic symbols and IPA 
symbols. In fact, there is an algorithm that affects the 
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' ; / 

I 
I 

sound of the phoneme depending on its phonetic environment, 

J' 

which does not ^always do as you expect, built into the 
synthesizer. ' 

'7 , _ 1 

' The problem of vowel duration still remains. The 
Votrax VS-6.0 generally has.at least three, sometimes four, 
different vowel durations. Only one of them is most 
suitable for any - given 'word. Information as to vowel 
duration is not generally given in dictionaries. 

' \ 

The sum of the accuracy of the component parts of the 
system is not equal to the final accuracy of the system' due 
to the system's design, as mentioned earlier in Chapter 3\ 
For this reason, it is meaningless to give accuracies for 
the component parts, especially considering that they were 
not designed as stand alone units. (The accuracy of 
syllabification is approximately 90?, stress assignment is 
about 85? and the text-to-speech rules is about 92? on the 
first 5,000 words of the Brown Corpus.) 


The accuracy" of ,any synthetic 

speech 

system on 

the 

first five 

thousand- words in the 

Brown 

Corpus is 

not 

neccesarily a 

1 ■ ' 

good, indication of how 

it will 

be received 

by 

the general 

public. With this in 

mind, an 

ex per iment 

wa s 

set up to get 

an indieation of the performance-of the system 


as judged by naive users 



The " Type 'N Tal k " Synthesizer 

\ v 

Recently, another synthesizer was added-to Concordia 
University's speech lab, the "Type 'N Talk". This 
synthesizer is based* on the SC—01 chip and its repertoire of 

phonemes is nearly identical to the Votrax VS-6.0. Both 

\ 

synthesizers are made by the same dompany. However',"'the 
"T^pe 1 N Talk" contains its own built-in algorithm 'for 
text-to-speech conversion. It is a very small inexpensive 

T 

unit and therefore contains a minimal text-to-speech 
algorithm. Its low price would certainly make it a 
contender for use in a reading machine for the blind. Since 
it is using an almost identical synthesizer to the one used 
in this research, it was decided; to include synthesized 
speech from this unit in the performance tests. Granted it 
employs an algorithm smaller than the one described in this 
research and therefore should probably not do a,s well. 
However, this is not the main concern. 

- '. f 

- ■ The idea in this thesis was to produce a small 
algorithm that could/be implemented on a microcomputer. If 
however, the new algorithm does not fare much better in the 
performance test than the "Type 'N Talk”, it would indicate 
that a fairly simple algorithm is all that is 
reasonable speech intellig iblity. 


needed for 
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Desig n of the Performance Te st 

The aim of the performance test, in the context of this 
thesis, was to obtain an indication of the intell igiblit y of 
the speech generated by the system described in this.thesis. 
From now on the system will be _ referred to as C.S.P. 
(Concordia Speech Project).. - , 

i ■ . 

Since further work is planned on C.S.P. an experiment 
that generated much more data than required for the purposes 
of this chapter was done. This would allow for more 

. f ' 

’complicated analysis to be performed at a later date. . . 

.4 

The experiment consisted of giving a dictation of nine 
lists of ten individual, unrelated sentences. The sentences 
have the property of not allowing prediction of the first or 
last part of the sentence based on the .other part. Every 
single' English phonetic sound was contained in each list in 
the frequency that it would normally occur. These lists of 
phonetically balanced sentences are tften referred 'to as 
Harvard Sentences [24]. 

The nine lists of sentences were presented to three 
seperate groups of subjects. Three lists were spoken by ,a 
human, three by the "Type. ’ N Talk”, and three by C.S.P. 

Table 4.1 shows the lists presented. The order of 
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presentation of the lists to the subjects was list one 

followed by list two and so on. From this table it can be 

seen that'the order,of presentation of the different systems 

is counter balanced and that each list is dictated once by 
1 » 
all three systems. 




TABLE 4.1 

DISTRIBUTION OF LISTS AMONG GROUPS 


-+-+- 


i Group + 

! i 1 12 

| - + - + -- 

: i i c i t 


- List 


-+. 




5 


i 




C ! T ! 


H 


8 

-+ 

T 


! T | H 


T i H ! C 


H 


i H i C 


H | C ! T 


NOTE: H=Human; T="Type ’N Talk"; C=C.S.P. 


After ti^e dictation of nine lists of 'Sentences, the 
subjects listened to a section of a Time magazine article of 
two hundred and forty-five words. Group one listened to the 
paragraph .given by - C.S.P. and group two was given the 
version' produced • by - the ’-'Type 'N Talk”. Group three 
listened- to the, human version. \ 

They were then asked, to write', in one or two sentences, 
what the paragraph was atjout. Following this, they were 
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expected to give a numerical rating of the presentation of 
the paragraph. 

The actual question asked was phrased as follows:' "At 
the 1 end of the paragraph you are about to hear would you 
please write one or two sentences that would summarize what 

the paragraph was about. Then, give a numerical rating 

, * , ! , 

between one and one hundred, which' reflects how you felt 
about your own understanding of the paragraph and its 
presentation." ' / 

\ I 

The; experiment was conducted at the Canadian Armed 

* l 

Forces' Royal Military College of Saint-Jean. One hundred 
and twelve cadets ranging in age from seventeen to 

twenty-one were randomly divided up into three equal groups. 

' v • \ . 

\ 

v‘ . ’ ' J 

Although the participants of the experiment were 
supposed to have been native English speakers, who had never 
heard any synthetic speech before, softie cadets not meeting 
these criteria took part. »Their papers were eliminated 
leaving , ninety subjects? twenty-five in group one, 
th i r ty-sevien in group two, and twenty-eight in group three. 

Each group was presented the dictation and the 
paragraph' from a pre-recorded tape, in one of three 
identical rooms, at the same volume and speech, rate of 
approximately 160 words per minute. 


v ^ 




I 


The dictation test for each list took approximately two 
minutes. , A thirty second break was given between tests, 
meaning that each voice wouldbe heard^ dgain in about five 
minutes. The entire dictation test, took twenty-four minutes 
to complete. . 


Results - ' 

Table 4.2 and figure 4.1 summarize theresults of the 

experiment. When viewing table 4.2 it is important to 

remember that it is based on naive subjects having only 

listened to twelve minutes of synthetic speech by the. end of 

the complete dictation test. In light of this, the results 
\ ’ 
are encouraging for C.S.P.. - 


TABLE 4..2 

✓ 

PERCENTAGE OF CORRECT WORDS- FOR EACH SYSTEM 



1 

1 


Trial 

"7' 




1 

1 

1 

i o ' i 
l ^ l 

3 , 

i 

Mean 

Human 

1 

93 

1 92 ! 

96 

i 

: i 

94 

c.s.p. '■ 

1 

55 

i 59 'i 

66 

j 

60 

Type ’N 

'Talk j 

1 6 

■ 

+ — H 

1 

1 -* 

1 OO 

1 

+ — H 

27, 

i 

1 

20 


Trial one is composed of the percent correct words that 
e'ach’ group obtained on exposure to' the first list dictated 
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by one of the three systems. Therefore, the percentage 
obtained for the human speaker in trial one is the sum of 
the percentage of correct words obtained in list one from 
group three, list two from group two and list three from 
group one. • 

, The sums are taken this way to cancel the effects that 
the order of 1 presentstion of the lists may generate. 

V 

The same holds true for trial two and three except that 
trial two deals with the second exposure to a particular 
system and trial three deals with the third exposure to-one 
of the three particular systems. 

Figure 4.1 shows that the general trend was to improve 
over time for each of the three voices. Improvement was 
more pronounced for the synthetic voices than for the human. 

A two way analysis, of varience with factor trial(lst, 
2nd, 3rd) and system("Type 'N Talk", C. S. P. , Human) with 
“r^oeated measure of both factors was done. -The main effect 

the system is F (2 , 178) = 4 2 31, .001. The main effect for 
t/nAthree trials is F(2,178) = 95, .001 , while the interaction 
between the system and the trials is F(4,356) =9.96, . 001. 

; In the case of the human it is most likely that the 
improvement is due to adapting to the testing procedure and 

.J* - 
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learning to write at the correct speed. 


Whilst this may be a factor in the synthetic speech 
.case, it i s also most probable that the cadets were adapting 

J~ - 

to the synthetic speech. 


TABLE 4.3 

\ 

PERFORMANCE ON THE PASSAGE 
(scores g iven as percentages) 



1 

System 


ocore s 

' 0 

| Hum an 

| C.S.P. | Type 

'N Talk 

correct 

I 100 

1.97 .1 

71 

Rating 

1 91 

1 

T“ H 

t 

i 

1 -C* 

1 

l 

1 

1 

+ — - 

1 

1 

1 

j 

32 


Table 4.3 shows, the results of the dictation of the 
passage. The passage was deemed correctly understood if the 
subject got the general idea of what it was about. The 
table also contains the average rating given to, the system 

o •) 

by the cadets, which roughly corresponds to their 
intelligibility score?. 

I. ' • , 

s . * 

♦ * 4 r 

These results wpuld tend to reinforce the feeling that 

. ’ ^ 

the synthetic speech system would do better om paragraph 
material than on, single, sentences containing little 
contextual information. 


# 
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FIGURE 4.1 


PERCENTAGE OF CORRECT WORDS FOR EACH SYSTEM 
IN GRAPH FORM 

' \ 
100 + 



0 +—■+■ 
1 


Human 1 


C.S.P. 


Type 

»N 

Talk- 


—+— 

; 2 

Trial 


-+- 

3 






\ 


■ ,o . • . 

sources of Errors 

$ 

i 

Preliminary investigation “ into „ the causes of errors 
indicated that there are three main sources of errors;, 
semantic, phonetic environment and pronunciation. 

j O 

c 

Semantic errors occurred in a great' many of the. lists 
given by C.S.P. and to a much lesser extent those given by 
the human. • This wds not that apparent in the lists done by 
the "Type ' N Talk" due to thfe limited number of words that 
the cadets actually comprehended and wrote down. 


These errors manifest themselves as changes in words 
used- that still produce a correct sentence. In the 

sentence, "Th.e friendly gang left the drug store',', "gang" 
was changed to "man" by a great many of the subjects , 

e > , 

% if *> 

listening to the C.S.P. version. One generally does not 

i 

associate \a gang as^ being friendly and would be more 

inclined to expect a man to be so.. , * 


\ "Hop"' was changed to "jump" in 
the fence and .plunge in." by the same 

'"Hop" and "jump" are highly associated 

•— , ’ s « 

of the sentence is not changed in this 


the sentence "Hop ov 
group of subject 
words and the-meani 
case. 


er 
s. 
ng 




Probably what’is happening here, is that 

*' ' > • 

understanding most o’f the sentence and 


the subjects 
attempting to 


are 



fill in any missing words according to' the semantics of the 

■ \ 

sentence. In these sentences such a strategy will 
intevitably lead'to' error. 

v. 

^Etear . 

e 

The phonetic environment aldo played a rdle in creating 
soitfe transcription errors. If the sentence "The heart beat 

f • 

strongly and with firm strokes.” is said quickly, it is 
inevitable that" "beat” may be thought to be "beats". This 
occurs because the "s” in, "stronglyj' that immed,iately 
follows "beat" may sound as though it is part of "beat" and 
"strongly". In the human case this is precisely an error 
that was m.ade. It is unlikely that such an error would have 
been made if a time adverbial had been present. 

The.vjDrd "wood" in "Wood is best for making toys and 

blocks.” was change^ to "what",in many instances when it was 

presented by C.S.P.. This substitution is not all that odd. 

In- -isolation the "t" and "d" sounds would be pronounced 

differently. This is not the case in a sentence. When -1 "t" 

is followed by a vowel, in this case the "i" from "is", the 

"t" is generally pronounce'd' as a "d" because of intervocalic 

> 

i 

voicing.^ The word "butter" is a good example of this. 


• "Wood is best ..." is- a strange way to formulate the 
beginning of a sentence, while "Whpt is best ..." is not. and 
is certainly more common. Given that the two sound very 
similar, it is not surprising that . th< substitution book 


V 
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place rendering a different, but perfectly acceptable 
sentence. • ' 


Pronunciation obviously makes a difference. The word 


^'tiny" was pronounced as "teeny" by the "Type 'N Talk" and 
not surprisingly not a single person got it correct. This 

* t , 

is a case of 'Straight incorrect pronunciation. (At least as 

- . ; y I 

far' as Webster's dictionary is- concerned.) * • 


There exists a more subtle instance of this problem. 

"What" anjd "wood", from the above example, both contain lax 

vowels. Although there is a definite difference between'the 

* '* - 

two vowels itsris a subtle one, not that distinguishable in 
machine speech. In some dialects there is a distinction 
made between a "w" and a "wh" sound,, but none is made by 
C.S.P.. It- would also se;em that none is made by "Type 'N' 

v . ' ■' * ' 

Talk". 


Preposition substitution, more, specifically "a" for 
"the", took place in many instances in the human condition. 
In general, humans tend not to pay too much attention to 
prepositions unless they are important. Errors of this kind 
are therefore to be expected. 


I 
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£.S.P. Versus " Type 1 N Talk " 

\ 

o 

The large difference in words* correct for C.S.P. and 

* " 

"Type 'N Talk" was not expected. It would be convenient to 
attribute all the difference between the two systems to the 
fact that C.S.P. uses a more sophisticated algorithm. 
Although this certainly makes a difference it should not 
account for all the difference. 

o 

Spectrographs taken of the sentence "The tiny girl took 
off her hat.", spoken by all three systems shoW soma 
important differences between systems (figures 4.2, 4.3 and 
4.4). The spectrographs were made from the tapes actually 
used in the performance test. 

* 

The formant transitions produced by the Votrax VS-6.0 

synthesizer, used in the C.S.P., tend to be smoother than 

' _ \ 

those of the "Type 'N Talk". f^uch , more informatipn is 

available from the speech produced by the Votrax VS-6.0.at 
the higher frequencies than that of the "Type 'N Talk". 
This ip probably not attributable to the SC-01 synthesizer 
chip used, but rather to the amplifier used. It seems to 

have a shqrp~~‘cut off point of about 3,000 Hertz. The 

j 1 ' 

amplifier Is held suspect, because the SC-01 chip in the 

t* ' V 

evaluation board supplied by Votrax soundp almost identical 
to the Votrax VS-6.0 synthesizer. Unfortunately, a 
spectrograph of the SC-01 chip in the evaluation board could 






\ 

\ 
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not be made at this time to verify this suspicion. 

i' 

The spectrograph of the human is included for 
illustrative purposes. There can be little, dqubt that the 
human voice supplies much more information 1 , although in this 

case not all of it is important. Table 4.4 shows the 

! ' 

percentage of words jthat were correctly recognised by the 
subjects for each j£f the three systems. Little problem was 

encountered for the human or C.S.P. cases,-but this was not 

' I 

the case ftor the "Type 'N Talk". The word "tiny" was not 

correctly recognized by any subject for the "Type 'N Talk" 

system, but it was probably due to the fact that it was 

( 

pronounced "teeny". 

e l 

/ 

Table 4.4 also shows the relative duration of each 
vowel for each^word in the example sentence for the "Type 'N 
Talk", C.S.P. and'human. The values in the table are pnly 

' f* 

approximations, but. the ratios between systems are correct. 

, e 

The duration of the vowels is different for the three 
systems. The Vowels in. "girl" and "took" were given shorter 
durations by C.S.P. . than "Type ? N Talk','. For all other 
words thh reverse is true. This could be due to the lack of 

* ~ • a 

stress assignment of the "Type 'N Talk's" algorithm and the 
inclusion of stress assignment in C.S.P.. '! 


( 
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The human vowel duration is shorter than either 
synthetic voice systems. Although the vowel duration in the 
word "tiny" is the same for the human and the "Type ? N Talk" 
condition, it must be remembered that they saicj two 
different words. 


TABLE 4.4 


PERCENTAGE OF SUBJECTS CORRECTLY IDENTIFING EACH 
WORD IN THE SENTENCE AND THE WORD'S VOWEL DURATION 

(in milliseconds). 


System 


<y 


Word 


the 


tiny!girl I took I off! 


her|hat j 


C.S.P. 


correct 


93 


93! 100! 1001100! 


1001100 ! 


duration 


117 


468! 306! 167 ! 234 ! 184!184i 

--+_ +- 1 —-+-h + 

0! 76! 5,2 ! *60 ! 68! 36! 


Type 'N correct 
Talk duration 


64 


100 


351 ! 401 | 200 1200! 


150 ! 167! 
92! 100'! 100 11 00! 100! 100!- 


Human 


correct 


1 00 


duration 


67 ! 351 ! 384 j 67i 84 } 

— + - + - + - + - + 


1171150! 
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CONCLUDING REMARKS 



Synthetic speech has many diverse applications, many of 
which require an unlimited vocabulary.. In the last few 

-years - various attempts -have been-made-to-formulate- 

algorithms which would allow for an unrestricted vocabulary 
without actually attempting to store every word. 


Algorithms ranging from very simple to extremely 
'complex have been generated. The simple algorithms, or sets 
of rules, do not produce adequately intelligible speech. 
The most complex ones do produce fairly intelligible speech, 
such as MIT's MTTalk system, but are too large to be of 
practical use at the present tW. ' 


In , this tiresis, a 
which is fairly accurate, 
very small computers, 
text-to-speech rules by 
realization. This also 
of the 'problems caused by 


new algorithm has been presented 
but still .compact enough to run on 
It extends on the i'dea of 
adding the concept of stress 
has the effect of alleviating some 
silent medial "e" . 
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Performance tests of ’the system indicate that it 
compares favourably to Susan Hertz's system [9 ]. Her system 
gets the first one thousand five hundred words in the Brown 

■j . , 

Corpus 'ninety, six percent correct; C.P.S. gets the first 
five thousand words one hundred percent correct. ■ • 

Although the percentage of correct words recognized for 
the Harvard sentences was' not as high as one would have 
liked t it must be noted that there was definitely an 
improvement over time. 

It is suspected that the score would improve 
appreciably over a few more hours. Persons who have been 
exposed to the system for a period of time tend to be able 

•k 

to understand it fairly well. 

t 


Further Improvements and Re search That Could be Done 

* . t 

C'.S.P could be markedly improved by taking into account 

v 

the prosodic elements of speech. This would vastly increase 
the size of the algorithm and substantially increase its 
executation time. It wbuld also necessitate the use of a 
parametric synthesizer. 

v . • « , t 

Unfortunately, not enough is known about the way in 
which humans produce speech to formulate an algorithm 
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' capable of generating perfect human like speech, witln an 

i '• j 

unlimited vocabulary. Therefore, further research must be 
directed at^obtaining a more complete picture of. human 
speech production before machine speech will become almost 
indistinguishable from Human speech. (Although this can be 
done with linear predictive coding, it does hot allow for an 
unlimited vocabulary.) “ 


In‘the.meantime , the algorithm presented herein could 
be used in a low cost synthetic speech' system. Although the 
speech prod.uced is not completely natural, the Important 
point is, it is intelligible. Furthermor e , ' the algorithm is 
of a reasonable complexity such that it can Be implemented 
on a fairly small computer. » 
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. APPENDIX 
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•Q 


GRAMMER OF THE TEXT-TO-SPEECH RULES-‘ 

\ . . - . ; ' ■ , 

In v ^his appendix some examples of the text-to-spj^ech 

rules are given. Chapter 3 contains an example of-how they 

are used and this appendix is to be used in conjunction with 

\ 

i t • ' ' a 

' l , « 

Each rule consists of two parts. The first part of the 

rule, to the left of the equal ^ign, indicates when the rule 

, ' * 

* I ■ 

is to be used. While the right hand side of “the rule 

< 

indicates what phonemes are to be used if the rule holdb. 

. * - V 

Interpretation of a rule is done in the following 
manner. First, the characters within the., square brackets 

are looked at to see if they are contained'in a word. If 

so, then scanning continues in a left to right 'direction 
from the ,,closing square - bracket unbii the equal sign is 
found in the rule, or a character is found in t;he word, that 

. O ’' V ' i • 

dpes not match the one 'in the rule. If everything has 

matched so far, the scanning continues from the opening, 

square bracket in a right to l„eft direction until no match 

* * 

is found, or the end of the rule is found. 

v • 

% , . • 

* ' . « 

y t * v . . 

«' .. 76 . ' 
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In some.cases, instead of a letter, a symbol is used to. 
represent a class of letters .or a condition,/' T)he following 
list, will explain all the symbols used: ' • 

% = ,Use this rule only "for suffixes. 

' ! = Any letter'. ' 1 

it = Word final or initial.- ' - \ 

? = Ariy one vowel. , ’ 

■*» . - ' 

[]^ Delimits the letters to be skipped if the rule 
- - • ho^lds.' . ' 

, - x '' 

"= = Seperates the' matching part of the rule from 

• X* 

the part which gives the, phonetic sound. 

. /a1/;a2;a3/ = Any one of the conditions al, a2, or a3 

/ \ - * 

, must hold. ■ 

, . ’ • « '' * 1 i 

There is no limit 1 -on the number of, 
conditions that may be used, except , 
that a rule must fit on one line. 

t ' • ■ \ . 

=. It is~ used to compliment fke above expression. , 
/B/~[A] VC//T/ means "A” not preceded by "B" or 
followed by "C", but must be followed by ,, T". 
Remember the reversing scanning pattern when 
reading rules within them. 


(ZI,M) or (ZI) 

I = At least >1" Z’s; 

M = No more -than ,f M" Z.,’s. 

Z can be: V = Any vowel. • 

. ' . 1 

C ■= Any consonant. 


T 


J 
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D = Any vowel provided it is 

’ . » ' i s' 

>not in the same vowel clusteh 
as the vowel immediately 
preceding this command. 

S = Any vowel provided it is 
in the same vowel' cluster 
.as the vowel immediately 
, preceding this command. 

N =■ Any vowel which is not 

stressed. x 




f) 


Some example rules follow: 
RULES 

%[ABLY]#=/UH3 B L E/ 

%[ABLE]#=/UH3 B UH3L/, 

CAIL3-/A1 1/ 

[AT0R}#=/A1 T EH3 R> 

[ATE]#=/A .T/ 

[AY]s/A Y/ \ 

[AU] = /AW2/ 

[A ]#s/UH 1 / 

[./A<S1); A/3=/UH3/' 

* 

«»#**t**«* B RULES 
BCB3=// 

^BODY3#=/B AH1 0 El/ 

%[BALL]# = /B AW L'/ 

.[BI]0=/B AH1 AY/ 




V t • 


» . 
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tBASIC I*/B A S EH3. . •- -' V 

fc #[B]#s/B E/_ ’ - 1 , ; 

' [BEEN]V = /B IN/' 

tBROAD] = /B' R AW D/ - "v 

[BITIL] = /B I U . * - • . 

.['B ] s /B/ 

■**»«»*«*»geheral STRESSED RULES 

, ; *' V 

C (V T, 1)] (Cl , 1 )ER (V 1 )=S ' 

[(VI, 1)] A /’LL//X;LL;BtlC ;TRIC; (Cl , 1)EDY#/=S 

‘ t , ' 

, [I](cr,1)/I0U{IA;I0;IU;EA;E(i;EU;IE/=S ‘ 
[(VI, 1)] (Cl ,1)IT#sS 
[(V1, 1) ] VE (Cl) = S 

9 

[rtl, 1)] (Cl , ^)E'/L ;T/sS ' 

,. : /' 

[A ]DE (Cl ) = S ■ ■ ' v 

[?] (vi )=l ' 

;^C (V1,1)3 /# ; »D#;GH ;SURE; »LL/=L , 

[(VI, 1)]SIS#=L 

t(V\, 1)] (Cl, 1)/ATE#;I0US/ = L . 

[(VI )] A /ND/(C1)0US=L 

’ ■ f ' 

E(vi)](ci, ijys#=L 




[/I;Y/3 (Cl, D'RsL 

[ (V 1,1)] ^/NRY/>/( C1 ,1 )R (V 

[ (VI, 1 )]CRE=L & 

[(VI, 1)] (Cl, 1 )/Y;AIL;IA;IO;IU;EA;EO;EU;IE;AL\0 
OT?;IVE;ILE,;OR#;OLAT;E/=L 

o • * ' 

[(VI,1)](C1,1)LE#=L 
[U](C1,1)(V1)=L , 

N * 

h 




